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Abstract 

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees con- 
stant regret in the stochastic setting, but has terrible performance for worst-case data. 
Other hedging strategies have better worst-case guarantees but may perform much worse 
than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, 
which is the first method that provably combines the best of both worlds. 

As part of our construction, we develop AdaHcdgc, which is a new way of dynamically 
tuning the learning rate in Hedge without us i ng th e doubling trick. AdaHedge refines a 
method by Cesa-Bianchi. Mansour. and Stoltz ( 2007t) . yielding slightly improved worst-case 
guarantees. 



By interleaving AdaHedge and FTL, the FlipFlop algorithm achieves regret within a 
constant factor of the FTL regret, without sacrificing AdaHedge's worst-case guarantees. 

AdaHedge and FlipFlop do not need to know the range of the losses in advance; more- 
over, unlike earlier methods, both have the intuitive property that the issued weights are 
invariant under rescaling and translation of the losses. The losses are also allowed to be 
negative, in which case they may be interpreted as gains. 



Keywords: Hedge, Learning Rate, Mixability, Online learning, Prediction with Expert 
Advice 
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1. Introduction 

We consider sequential prediction in the general framework of Decis ion Theoretic Online 



Learning (DTOL) or "the Hedge setting" (jFreund and Schapird . ll997l ). which is a variant of 



"prediction with expert advice" ( Vovk . 19981 ). Our goal is to develop a sequential prediction 



algorithm that performs well not only on adversarial data, which is the scenario most studies 
worry about, but also when the data are easy, as is often the case in practice. Specifically, 
with adversarial data, the worst-case regret (defined below) for any algorithm is {1(\^T), 
where T is the number of predictions to be made. Algorithms such as Hedge, which have 
been designed to achieve this lower bound, typically continue to suffer regret of order y/T, 
even for easy data, where the regret of the more intuitive but less robust Follow-the-Leader 
(FTL) algorithm (also defined below) is bounded. Here, we present the first algorithm which, 
up to constant factors, provably achieves both the regret lower bound in the worst case, 
and a regret not exceeding that of FTL. Below, we first describe the Hedge setting. Then 
we introduce FTL, discuss sophisticated versions of Hedge from the literature, and give an 
overview of the results and contents of this paper. 

1.1 Overview 

In the hedge setting, a learner has to decide each round t = 1,2,... on a weight vec- 
tor Wt = (wt,i, ■ ■ ■ , Wt,K) ° ver K "experts". (This term derives from the strongly re- 
lated prediction with exp e rt ad vice paradigm ( Littlestone and Warmuth . 1994 ; Vovk . 19981 : 



Cesa-Bianchi and Lugosl 20061 ).) Nature then reveals a X-dimensional vector containing 



the losses of the experts it = {(-t,x, ■ ■ ■ ,^t,x) £ K • Learner's loss is the dot product 
ht = Wf £t, which can be interpreted as the expected loss if Learner uses a mixed strategy 
and chooses expert k with probability w t ^. We denote cumulative versions of a quantity by 
capital letters, and vectors are in bold face. Thus, Lx,k = Y^t=\ ^t,k denotes the cumulative 
loss of expert k up to the present round T, and Ht = X^t=i ^* * s Learner's cumulative loss 
(the "Hedge loss"). 

Learner's performance is evaluated in terms of her regret, which is the difference between 
her cumulative loss and the cumulative loss of the best expert: 

TZt = Ht — L T , where L T = minL^fc. 

k 

A simple and intuitive strategy for the Hedge setting is Follow-the-Leader (FTL), which 
puts all weight on the expert (s) with the smallest loss so far. More precisely, we will define 
the weights Wt for FTL to be uniform on the set of leaders {A; | Lt~\ t k = ijLi}, which 
is often just a singleton. FTL works very well under many circumstances, for example in 
stochastic scenarios where the losses are independent and identically distributed (i.i.d.). In 
particular, the regret for Follow-the-Leader is bounded by the number of times the leader 
is overtaken by another expert (Lemma [9]), which in the i.i.d. case almost surely happens 
only a finite number of times (by the uniform law of large numbers), provided the mean loss 
of the best expert is smaller than the mean loss of the other experts. As demonstrated by 
the experiments in Section El many more sophisticated algorithms can perform significantly 
worse than FTL. 

The problem with FTL is that it breaks down badly when the data are antagonistic. For 
example, if one out of two experts incurs losses |, 0, 1, 0, . . . while the other incurs opposite 
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losses 0,1,0,1,..., the regret for FTL is about T/2 (this scenario is further discussed in 
Section 15 . 1 1) . This has prompted the development of a multitude of alternative algorithms 
that provide better worst-case regret guarantees. 

The seminal strategy for the learner is called Hedge ( Freund and Schapire . 1997 . 19991 ). 
Its performance crucially depends on a parameter n called the learning rate. Hedge can be 
interpreted as a generalisation of FTL, which is recovered in the limit for rj — > oo. In many 
analyses, the learning rate is changed from infinity to a lower value that optimizes some 
upper bound on the regret. Doing so requires precognition of the number of rounds of the 
game, or of some property of the data such as the eventual loss of the best expert Ltp. The 
simplest way to address this issue is to use the so-called doubling trick: setting a budget on 
the relev ant statistic, and restarting the algorithm with a double budget when the budget is 
depleted (jCesa-Bianchi and Lugosil . l2006l ; ICesa-Bianchi et al,Ul9971 : lHazan and Kald . 120081 ) ; 
r\ can then be optimised for each individual block in terms of the budget. Better bounds, 
but harder analyses, are typically obtained if the learning ra t e is adjusted each round based 
on previous observations, see e.g. (jCesa-Bianchi and Lugosil. 120061: lAuer et al.l , 120021 ) . 

The Hedge strategy presented by Cesa-Bianchi. Mansour. and Stoltz ( 20071 ) is very closely 
related to the approach described here. The relevant algorithm, which we refer to as CBMS, 
is defined in (16) in Section 4.2 of their paper. Its regret satisfied 



K 



CBMS 



< 4 



L* T (aT - L* T 



T 



In # + 39(7 max{l,lnK}, 



(1) 



where a is the range of observed losses; if all losses are nonnegative, this is the maximum 
loss attained by any expert at any time. Thus, in the worst case this algorithm has a regret 
of order VT, but it performs much better when the loss of the best expert is close to 
either or crT. 

The goal of this work is to develop a strategy that retains this worst-case bound, but 
has even better guarantees for easy data: its performance should never be substantially 
worse than that of Follow-the-Leader. At first glance, this may seem like a trivial problem: 
simply take both FTL and some other hedging strategy with good worst-case guarantees, 
and combine the two by using FTL or Hedge recursively. To see why such approaches do 
not work, suppose that FTL achieves regret IZjl , while the safe hedging strategy achieves 
regret 7Z^ . We would only be able to prove that the regret of the combined strategy 



compared to the best original expert satisfies TZ^ < mm{TZ^ , } + Qj,, where Qj, is the 
worst-case regret guarantee for the combination method, e.g. (pQ). In general, either 7Z^ or 
T^safe ma y De c i ose to zero, while at the same time both algorithms have loss close to T/2, 



so that Q^p = n(vT). That is, the overhead of the combination method will dominate the 
regret! 

We address this issue in two stages. First, in Section El we dey elop AdaHedge, which 
is a refinement of the CBMS strategy of Cesa-Bianchi et al. ( 20071 ) for which we can ob- 
tain similar bounds, including (pQ), but with a factor 2 improvement of the dominant term 
(Theorem [8|) . Like CMBS, the learning rate is tuned in terms of a direct measure of past 
performance. However, AdaHedge not only recovers the "fundamental" regret bounds of 



1. The leading constant of 4 was later improved to approximately 2.63 in (|Gerchinovitd . l201ll . Remark 2.2), 
essentially by using Lemma [2] below. Our approach allows a further reduction to 2. 
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CMBS, but it has the intuitive property that the weights it issues are themselves invari- 
ant to translation and rescaling of the losses (see Section . The analysis of AdaHedge 
i s also surprisingly clea n, A preliminary version of this strategy was presented at NIPS 



(Van Erven et al.. 2011 



Second, in Section El we build on AdaHedge to develop the FlipFlop approach, which 
alternates between FTL and AdaHedge. For this strategy we can guarantee 



11% 



0(mm{TZf,g^}), 



where C/f? 1 is the regret guarantee for AdaHedge; Theorem 1141 provides a precise statement. 
Thus, FlipFlop is the first algorithm that provably combines the benefits of Follow-the- 
Leader with robust behaviour for antagonistic data. 

A key concept in the design and analysis of our algorithms is what we call the mixability 
gap, introduced in Section \2. 11 This quantity also appears in earlier works, and seems to be 
of fundamental importance in both the current Hedge setting as in stochastic settings. We 
elaborate on this in Section 16.21 where we provide the big pictu re underlying this re search 
and we briefly indicate how it relates to practical work such as (jDevaine et al!l2012h . 



1.2 Related Work 



As m entioned, AdaHedge is a refinement of the strategy analysed by ICesa-Bianchi et al 
which is itself more sophisticated than most earlier approaches, with two notable 



exceptions. First, by slightly modifying the weights, an d tuning the learning r ate in terms 
of the cumulative empirical variance of the best expert, Hazan and Kale ( 20081 ) are able to 
obtain a bound that multiplicatively dominates (pQ). However, their method requires the 
doubling trick, and as demonstrated by the experiments i n Sec tion [5j it does not achieve 
the benefits of FTL. Second, Chaudhuri, Freund and Hsu ( 20091 ) describe a strategy called 
NormalHedge that can efficiently compete with the best e-quantile of experts; their bound 
is incomparable with the bound for AdaHedge. In the experimental section we discuss the 
performance of these approaches compared to AdaHedge and FlipFlop. 



Other approaches to sequential predictio n include defensive foreca sting (jVovk et al 
20051 ). and Following the Perturbed Leader (jKalai and Vempalal . I2003T). These radically 



differ ent ap proaches also allow compe ting with the best e-quantile, see (IChernov and Vovkl . 
2010l ) and ( Hutter and Poland . 2005 ); the latter article also considers nonuniform weights 
on the experts. 

The "safe MDL" and "safe Bayesian" algorithms by iGriinwaldl (|201ll . l2012h share the 
present work's focus on the mixability gap as a crucial part of the analysis, but are concerned 
with the stochastic setting where losses are not adversarial but i.i.d. FlipFlop, safe MDL 
and safe Bayes can all be interpreted as methods that attempt to choose a learning rate r\ 
that keeps the mixability gap small (or, equivalently, that keep the Bayesian posterior or 
Hedge weights "concentrated" ) . 



1.3 Outline 

In the next section we present and analyse AdaHedge. Then, in Section [3j we build on 
AdaHedge to develop the FlipFlop strategy. The analysis closely parallels that of AdaHedge, 
but with extra complications at each of the steps. Both algorithms are initially analysed 
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for normalised losses, which take values in the interval [0,1]. In Section U] we extend their 
analysis to unnormalised losses. Then we compare AdaHedge and FlipFlop to existing 
methods in experiments with artificial data in Section [3 Finally, Section [6] contains a 
discussion, with ambitious suggestions for future work. 



2. AdaHedge 

In this section, we present and analyse the AdaHedge strategy. The behaviour of AdaHedge 
does not change under scaling or translation of the losses. However, to keep the analysis 
simple, we will initially assume throughout this and the next section that all losses are 
normalised to the unit interval, i.e. it S [0, 1] ■ Unnormalised losses are treated in Section^ 
by a reduction to the normalised case. 

To introduce our notation and proof strategy, we start with the simplest possible analysis 
of vanilla Hedge, and then move on to refine it for AdaHedge. 



2.1 Basic Hedge Analysis for Constant Learning Rate 

Following Freund and Schapire ( 19971 ) we define the Hedge or exponential weights strategy 



as the choice of weights 



w t ,k = ~ , (2) 



where wi = (1/K, . . . , 1/K) is the uniform distribution, Zt = w\ • e~v L t-i [ s a normalizing 
constant, and rj £ (0, oo) is a parameter of the algorithm called the learning rate. If rj = 1 
and one imagines Lt-i,fe to be the negative log-likelihood of a sequence of observations, 
then wt k is the Bayesian posterior probability of expert k and Zt is the marginal likelihood 
of the observations. Consequently, like in Bayesian inference, the weights can be updated 
multiplicatively, i.e. we have wt+i,k oc Wt,k&~ ,k - 

The loss incurred by Hedge in round t is ht = Wt • it, and our goal is to obtain a 
good bound on the cumulative Hedge loss Ht = Ylt=i ht- To this end, it turns out to be 
technically convenient to approximate ht by the mix loss 



m t = --\n(w t -e-^) (3) 
V 

which accumulates to Mt = Ylt=i m t- This approximation is a standard to ol in t he lit - 
erature. For example, the mix loss m< corresponds to the loss of Vovk's ( 19981 : 200ll ) 



Aggregating Pseudo Algorith m, and tracking the evolut i on of —mt is a crucial ingredient 
in the proof of Theorem 2.2 of ICesa-Bianchi and Lugosi (j200d) . 



The definitions of Hedge and the mix loss may both be extended to ij = oo by letting 7] 
tend to oo. In the case of Hedge, we then find that Wt becomes a uniform distribution on the 
set of experts {k \ Lt-i t k = that have incurred smallest cumulative loss before time 

t. That is, Hedge with r\ = oo reduces to Follow-the- Leader, with ties broken by dividing 
the probability mass uniformly. For the mix loss, we find that the limiting case as r] tends 
to oo is mt = Lt — Lt_ 1 . 

In our approximation of the Hedge loss ht by the mix loss mt , we call the approximation 
error St = ht — mt the mixability gap. Bounding this quantity is a standard part of the 
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analysis of Hedge- type algorithms (see, for example, Lemma 4 of Cesa-Bianchi et al. ( 20071 )) 
and it also appears to be a fundamen tal notion in sequential prediction even when only so- 
called mixable losses are considered ( Grunwaldl . 2011 . 2012 ): see also Section ED We let 
At = Si + . . . + St denote the cumulative mixability gap, so that the regret for Hedge may 
be decomposed as 

U T = H T - L* T = M T - L* T + A T - (4) 

Here Mt — L T may be thought of as the regret under the mix loss and At is the cumulative 
approximation error when approximating the Hedge loss by the mix loss. Throughout the 
paper, our proof strategy will be to analyse these two contributions to the regret, Mt — L T 
and At, separately. The following lemma, which is proved in Appendix lAl collects a few 
basic properties: 



Lemma 1 (Mix Loss with Constant Learning Rate) For any learning rate rj 6 (0, oo] 

1. Mix loss is less than Hedge loss (mt < ht) so that 5t > 0. Moreover, for losses in the 
range [0, 1], we have mt > and ht < 1, so that also St < 1. 

2. Cumulative mix loss telescopes: Mt = — - In (w\ • e _?? T ) . 



3. Cumulative mix loss approximates the loss of the best expert: L T < Mt < L T + 



\nK 



4- The cumulative mix loss Mt is nonincreasing in m 



In order to obtain a bound for Hedge, one can use the following well-known bound on 
the mixa bility gap, which is obtained usi ng Hoeffding's bound on the cumulant generating 
function (jCesa-Bianchi and Lugosil . 12000 . Lemma A.l): 



St< 



(5) 



from which < Trj/8. Together with the bound Mt 
property ^3] this leads to 



L T < \n(K)/ri from mix loss 



TZ T = (M T -L t ) + A t <— + ^. 

7] 8 



(6) 



The bound is optimized for rj = ^8\n(K)/T, which equalizes the two terms. This leads to 
a bound on the r egret of y/TlnjK)/!, mat c hing t he lower bound on worst-case regret from 



the textbook by Cesa-Bianchi and Lugosil ( 2006 . Sections 2.2 and 3.7). We can use this 
tuned learning rate if the time horizon T is known in advance; to deal with the situation 
where it it is not, the doubling trick can be used, at the cost of a worse constant factor in 
the leading term of the regret bound. 

In the remainder of this section, we introduce the AdaHedge strategy, and refine the 
steps of the analysis above to obtain a better regret bound. 
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2.2 AdaHedge Analysis 

In the previous section, we split the regret for Hedge into two parts: Mt — L T and At, 
and we obtained a bound for both. The learning rate rj was then tuned to equalise these 
two bounds. The main distinction between AdaHedge and other Hedge approaches is that 
AdaHedge does not consider an upper bound on At in order to obtain this balance: instead 
it aims to equalize At and ln(A') /rj. As the cumulative mixability gap is monotonically 
increasing and can be observed on-line, it is possible to adapt the learning rate directly 
based on At- 

Perhaps the easiest way to achieve this is by using the doubling trick: each subsequent 
block uses half the learning rate of the previous block, and a new block is started as soon 
as the observed cumulative mixability gap At exceeds the bound on the mix loss h\{K)/r], 



which ensures these two quantities are ec 



ual at the end of each b lock. This is the approach 



taken in an earlier version of AdaHedge ()Van Erven et all l201ll ). However, we can achieve 



the same goal much more elegantly, by decreasing the learning rate with time as follows: 

ah /«n 

V? = ^ (7) 

(Note that rff 1 = oo.) The definitions (J2|) and ([3]) of the weights and the mix loss are 
modified to use this new learning rate: 

«8 = "i =SZ i rnf = -\ lu{ W f ■ e-*"*), (8) 
wf h • e _r >t L t-i nf n 

with wf h = (1/K, . . . , 1/K). Note that the multiplicative update rule for the weights no 
longer applies when the learning rate varies with t; the last three results of Lemma[T]are also 
no longer valid. Later we will also consider other algorithms to determine variable learning 
rates; to avoid confusion the considered algorithm is always specified in the superscript in 
our notation. See Tabled] for reference. 

From now on, AdaHedge will be defined as the Hedge algorithm with learning rate 
defined by ([7]). For concreteness, a mat lab implementation appea rs in Figure [TJ 



Our learning rate is similar to that of Cesa-Bianchi et al. (2003), but it is always higher 



and as such may exploit easy sequences of losses more aggressively. Moreover our tuning of 
the learning rate simplifies the analysis, leading to tighter results; the essential new technical 
ingredients appear as lemmas [3] and [5] below. 

We analyse the regret for AdaHedge like we did in the previous section for a fixed 
learning rate: we again consider M T h — L T and A^ separately. This time, both legs of 
the analysis become slightly more involved. Luckily, a good bound can still be obtained 
with only a small amount of work. First we show that the mix loss is bounded by the 
mix loss we would have incurre d if we would have used the final learning rate rj T ^ all along 



(jKalnishkan and Vvuginl . 120051 . Lemma 3): 



Lemma 2 Let dec be any strategy for choosing the learning rate such that rji > rj2 > • • ■ 
Then the cumulative mix loss for dec does not exceed the cumulative mix loss for the strategy 
that uses the last learning rate rjT from the start: M T CC < Mp T K 
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it Loss vector for time t 

L\ = mirifc Lt k Cumulative loss of the best expert 

w fz = e -vf s -L t -i i ^ e -vf e L t -i,k Weights played at time t 

hf g = wf g • l t Hedge loss 

m f s = _ _i_ l n (wfg . e -vt^ t ) Mix loss 

df e = hf g — mf g Mixability gap 

vf g = Var fc ^ iu ai g (£ ti fc) Loss variance at time t 

Kf g = Hf g - L* t Regret at time t 

A capital letter denotes the cumulative value, e.g. A^! g = J2t=i ^f g - 

The "alg" in the superscript refers to the algorithm that defines the learning rate used 
at each time step: "(77)" represents Hedge with fixed learning rate r/; "ah" denotes 
AdaHedge, defined in (|7|); "ftl" denotes Follow-the-Leader (ry ftl = 00), and "ff" denotes 
FlipFlop, defined in (fH]) . 



Table 1: Notation 



Proof Using mix loss property we have 

E mf* = E W %) - M t-i) < E ( M i vt) - M i-i l} ) = M i ,T \ 



t=l t=l t=l 

which was to be shown. 



We can now show that the two contributions to the regret are still balanced. 
Lemma 3 The AdaHedge regret is llf = M| h - L* T + A^ h < 2Af!\ 

Proof As 5f h > for all t (by mix loss property #d]), the cumulative mixability gap 
Af h is nondecreasing. Consequently, the AdaHedge learning rate r]f h as defined in ([7J is 
nonincreasing in t. Thus Lemma [2] applies to M|; h ; together with mix loss property $(31 
and ([7]) this yields 

M| h < M? 5 ^ < Lt + ^ = L T + A^_i < L* T + A| h . 

Substitution into the trivial decomposition TZ^ 1 = M^ h — + A|, h yields the result. ■ 

The remaining task is to establish a bound on A|!\ As before, we start with a bound on 
the mixability gap in a single round, but rather than ([5]), we use Bernstein's bound on the 
mixability gap in a single round to obtain a result that is expressed in terms of the variance 
of the losses, vf = Var fc _ ah [^ fc ] = £ fc wf%{i t , k - /if) 2 . 
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'/, Returns the losses of AdaHedge . 

% l(t,k) is the loss of expert k at time t 

function h = adahedge(l) 

[T, K] = size(l) ; 

h = nan(T.l) ; 

L = zeros (1 ,K) ; 

Delta = 0; 

for t = 1:T 

eta = log(K) /Delta; 

[w, Mprev] = mix(eta, L); 

h(t) = w * l(t, :) ' ; 

L = L + l(t, :); 

[", M] = mix(eta, L) ; 

delta = max(0, h(t) - (M-Mprev) ) ; 

% ("max clips numeric Jensen violation) 

Delta = Delta + delta; 

end 

end 



% Returns the posterior weights and mix loss 
% for learning rate eta and cumulative loss 
% vector L, avoiding numerical instability. 
function [w, M] = mix(eta, L) 
mn = min(L) ; 

if (eta == Inf) X Limit behaviour: FTL 
w = L==mn; 

else 

w = exp(-eta .* (L-mn) ) ; 

end 

s = sum(w) ; 
w = w / s ; 

M = mn - log(s/length(L) )/eta; 

end 



Figure 1: Numerically robust MATLAB implementation of AdaHedge 



Lemma 4 (Bernstein's Bound) Let rjt = rj t g £ (0, oo) denote the finite learning rate 
chosen for round t by any algorithm "alg". For losses in the range [0, 1], the mixability gap 
<5^ g satisfies 

< v t (9) 



sralg 



Vt 



Further, vl 



ale 



< 1/4. 



Proof This is Bernstein's bound ( Cesa-Bianchi and Lugosil . I2006L Lemma A. 5) on the 
cumulative generating function, applied to the random variable £t,k with k distributed ac- 
cording to wf s . ■ 



Bernstein's bound is more sophisticated than ([5]), because it expresses that the mixability 
gap St is small not only when rjt is small, but also when all experts have approximately the 
same loss, or when the weights Wt are concentrated on a single expert. 

The next step is to use Bernstein' s inequality to obta i n a b ound on the cumulative 
mixability gap A|!\ In the analysis of Cesa-Bianchi et al. ( 20071 ) this is achieved by first 



applying Bernstein's bound for each individual round, and then using a telescoping argument 
to obtain a bound on the sum. With our learning rate ([7]) it is convenient to reverse these 
steps: we first telescope, which can now be done with equality, and subsequently apply a 
stricter version of Bernstein's inequality. 

Lemma 5 For losses in the range [0, 1], AdaHedge's cumulative mixability gap satisfies 

(Af) 2 < V^ h In K + (1 + § In K) Af. 
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Proof In this proof we will omit the superscript "ah" . Using the definition of the learning 
rate ((7|) and St < 1 (from mix loss property #H]), we S e t 



T 



(10) 



The only inequality in this equation replaces 5 2 by <5j, which is of no concern: the resulting 
Aj- term adds 2 to the regret bound. We will now show 

-<\v t + \5 t . (11) 
Vt 



This supersedes the bound 8t/vt < (e— 2)vt used bv lCesa-Bianchi et al.l (|2007l ). Even though 
at first sight circular, this form has two major advantages. Inclusion of the overhead ^5t 
will only affect smaller order terms of the regret, but admits a significant reduction of the 
leading constant. This gain directly percolates to our regret bounds below. Additionally 
(jll|) holds for all Vi which simplifies tuning considerably. 

First note that (jlip is clearly valid if Vt = °°- Assuming that vt is finite, we can obtain 
this result by rewriting Bernstein's bound ([9]) as follows: 

g*E ^.rg^ J£ 

oVt > St ■ = f(vt)$t, where f(x) = k . 

2e rit — 2r]t — 2 rft xe x — x z — x 

Remains to show that f{x) < 1/3 for all x > 0. After rearranging, we find this to be the 
case if 

(3 - x)e x < \x 2 + 2x + 3. 

Taylor expansion of the left-hand side around zero reveals that (3 — x)e x = ^x 2 + 2x + 
3 — ^x 3 ue u for some < u < x, from which the result follows. The proof is completed by 
plugging (dH) into (OH). ■ 



Combinat ion of these results yields the following natural regret bound, analogous to 



Theorem 5 of ICesa-Bianchi et al.l (|2007l ). 



Theorem 6 For losses in the range [0,1], AdaHedge's regret is bounded by 



TZf < 2JV T Jci hxK + | In if + 2. 
Proof Lemma [5] is of the form 

(Af) 2 < a + bAf, (12) 
with a and b nonnegative numbers. Solving for A^i 1 then gives 



Af < \b+\^b 2 + Aa < ifc+Kv^+v 7 ^) = Va + b, 
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which by Lemma [3] implies that 

lZf<2^ + 2b. (13) 

Plugging in the values a = V^ h In K and b = | In K + 1 from Lemma [5] completes the 
proof. ■ 

This first regret bound for AdaHedge is difficult to interpret, because the cumulative loss 
variance Vf: h depends on the actions of the AdaHedge strategy itself (through the weights 
wf h ). Below, we will derive a second regret bound for AdaHedge that depends only on 
the data. However, AdaHedge has one important property that is captured by this first 
result that is no longer expressed by the worst-case bound we will derive below. Namely, 
if the data are easy in the sense that there is a clear best expert, say k* , then the weights 
played by AdaHedge will concentrate on that expert. If wr?» — > 1 as t increases, then 
the loss variance must decrease: vf 1 — > 0. Thus, Theorem [6] suggests that the AdaHedge 
regret may be bounded if the weights concentrate on the best expert sufficiently quickly. 
This turns out to be the case: we can prove that the regret is indeed bounded for the 
stochastic setting where the loss vectors It are independent, and E[Lt t k* — Lth\ = £l{tP) for 
all k k* and any f3 > 1/2. This is an important feature of AdaHedge when it is used as 
a stand-alone algorithm , and we provide a proof for the previous version of the strategy in 
(|Van Erven et all 1201 lh . See Section 15.41 for an example of concentration of the AdaHedge 



weights. We will not pursue this further here because the Follow-the-Leader strategy also 
incurs bounded loss in that case; we rather focus attention on how to successfully compete 
with FTL in Section [3l 

We now proceed to derive a bound that dep e nds o nly on the data, using the same 
approach as the one taken by Cesa-Bianchi et al. ( 20071 ). We first bound the cumulative 



loss variance as follows: 

Lemma 7 Suppose H^ 1 > L^. Then, for losses in the range [0,1], the cumulative loss 
variance for AdaHedge satisfies 

V f < L t( T ~ L t) +2A a h- 

Proof The sum of variances is bounded by 

where the first inequality is provided by LemmalU and the second is Jensen's. Subsequently 
using iff, 11 > (by assumption) and H^ 1 < + 2Af, h (by Lemma [3]) yields 



^ (L* T + 2Af )(T-L* T ) ^ L* T (T-L* T ) + ^ ^ 



, ah 



which was to be shown. 



This can be combined with Lemma [5] an d [3] to obtain the fol l owing bound, which im- 



proves the dominant term of Corollary 3 of ICesa-Bianchi et al.l (120071 ) by a factor of 2: 



11 



De Rooij, Van Erven, Grunwald and Koolen 



Theorem 8 For losses in the range [0,1], AdaHedge's regret is bounded by 



Uf<2^ L * T{T T L ^ lniv +f lniv +2. 

Proof If < L* Tl then TZ^ 1 < and the result is clearly valid. But if H^ 1 > L* T , we can 
bound Vf h using Lemma [7] and plug the result into Lemma [5] to get an inequality of the 
form C[2]) with a = L* T (T - L* T ) /T In K and b = § In K + 1. Following the steps of the proof 
of Theorem [6] with these modified values for a and b we arrive at the desired result. ■ 



This is the best known bound for a Hedge algorithm where the regret is expressed in 
terms of the loss rate L^/T of the best expert. Note that the bound is maximized for 

= T/2, in which case the do minant term r e duces to y/ThiK. This matches the best 
known result of the same form ( Gerchinovitzl . 2011 ). and improves upon the results of 
( Cesa-Bianchi and Lueos i l200fil ) by a factor y/2. Alternatively, we can simplify our regret 
bound using (T — Lj)/T < 1 to obtain a dominant term of 2^/L^, In if. This also improves 
the best known result (jAuer et all 120020 by a factor of v2. In both cases, our analysis is 
more direct. 

Note that the regret is small when the best expert either has a very low loss rate, or a 
very high loss rate. The latter is important if the algorithm is to be used for the scenario 
where we are provided with a sequence of bounded gain vectors gt rather than losses: we 
can translate the gains into losses using Itk = 1 — gt,k, and then run AdaHedge. The bound 
expresses that we incur small regret even if the best expert has a very small gain. 

In the next section, we show how we can compete with FTL while maintaining these 
excellent guarantees up to a constant factor. 



3. FlipFlop 

AdaHedge balances the cumulative mixability gap A^f 1 and the mix loss regret MjT — 
by reducing wf h as necessary. But, as we observed previously, if the data are not hopelessly 
adversarial we might not need to worry about the mixability gap: as Lemma 0] expresses, 
<5f h is also small if the variance vf h of the loss under the weights wfl is small, which is the 
case if the weight on the best expert maxfc becomes close to one. 

AdaHedge is able to exploit such a lucky scenario to an extent: as explained in the 
discussion that follows Theorem [6l if the weight of the best expert goes to one quickly, 
AdaHedge will have a small cumulative mixability gap, and therefore, by Lemma [3l a small 
regret. This happens, for example, in the stochastic setting with independent, identically 
distributed losses, when a single expert has the smallest expected loss. Similarly, in the 
experiment of Section 15.41 the AdaHedge weights concentrate sufficiently quickly for the 
regret to be bounded. 

There is the potential for a nasty feedback loop, however. Suppose there are a small 
number of difficult early trials, during which the cumulative mixability gap increases rela- 
tively quickly. AdaHedge responds by reducing the learning rate (j7|), with the effect that 
the weights on the experts become more uniform. As a consequence, the mixability gap in 
future trials may be larger than what it would have been if the learning rate had stayed 
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high, leading to further unnecessary reductions of the learning rate, and so on. The end 
result may be that AdaHedge behaves as if the data are difficult and incurs substantial 
regret, even in cases where the regret of Hedge with a fixed high learning rate, or of Follow- 
the-Leader, is bounded! Precisely this phenomenon occurs in the experiment in Section 15.21 
below: AdaHedge's regret is close to the worst-case bound, whereas FTL hardly incurs any 
regret at all. 

It appears, then, that we must either hope that the data are easy enough that we can 
make the weights concentrate quickly on a single expert, by not reducing the learning rate 
at all; or we fear the worst and reduce the learning rate as much as we need to be able 
to provide good guarantees. We cannot really interpolate between these two extremes: an 
intermediate learning rate may not yield small regret in favourable cases and may at the 
same time destroy any performance guarantees in the worst case. 

It is unclear a priori whether we can get away with keeping the learning rate high, or that 
it is wiser to play it safe using AdaHedge. The most extreme case of keeping the learning 
rate high, is the limit as rj tends to oo, for which Hedge reduces to Follow-the-Leader. In 
this section we work out a strategy that combines the advantages of FTL and AdaHedge: 
it retains AdaHedge's worst-case guarantees up to a constant factor, but its regret is also 
bounded by a constant times the regret of FTL (Theorem I14p . Perhaps surprisingly, this is 
not easy to achieve. To see why, imagine a scenario where the average loss of the best expert 
is substantial (say, about 0.5 per round), whereas the regret of either Follow-the-Leader or 
AdaHedge, is small. Since our combination has to guarantee a similarly small regret, it 
has only a very limited margin for error. We cannot, for example, simply combine the 
two algorithms by recursively plugging them into Hedge with a fixed learning rate, or into 
AdaHedge: the performance guarantees we have for those methods of combination are too 
weak. Even if both FTL and AdaHedge yield small regret on the original problem, choosing 
the actions of FTL for some rounds and those of AdaHedge for the other rounds may fail, 
because the regret is not necessarily increasing, and we may end up picking each algorithm 
precisely in those rounds where the other one is better. 

These considerations motivate the FlipFlop strategy (superscript: "ff") described in this 
section, where we carefully alternate between the optimistic FTL strategy, and the worst- 
case-proof AdaHedge to get the best of both worlds. 

3.1 Exploiting Easy Data by Following the Leader 

We first investigate the potential benefits of FTL over AdaHedge. Lemma [9] below identifies 
the circumstances under which FTL will perform well, which is when the number of leader 
changes is small. It also shows that the regret for FTL is equal to the cumulative mixability 
gap when FTL is interpreted as a Hedge strategy with infinite learning rate. 

Lemma 9 Let ct be an indicator for a leader change at time t: define Ct = 1 if t = 1 or if 
there exists an expert k such that L t _\^ = L\_ x while L t ^ ^ L*, and q = otherwise. Let 
Ct = Ylt=i c t b e the total number of leader changes up to time T . Then, for losses in the 
range [0,1] , the FTL regret satisfies 

= A ( ~> < C T . 
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Proof We have M^°°^ = L* T by mix loss property $0 and consequently TZ^ = + 
m£° ] -L* t = A^°\ 

To bound , notice that, for any t such that a = 0, all leaders remained leaders and 
incurred identical loss. It follows that nif°^ = L\ — L%_\ = h^ 00 ^ and hence <5j ^ = 0. By 
bounding 5f°°^ < 1 for all other t we obtain 

t=l t : Ct = l t : ct = l 



as required. ■ 

We see that the regret for FTL is bounded by the number of leader changes. This 
is a natural measure of the difficulty of the problem, because it remains small whenever a 
single expert makes the best predictions on average, even in the scenario described above, in 
which AdaHedge gets caught in a feedback loop. One easy example where FTL outperforms 
AdaHedge is when the losses are (1,0), (1,0), (0, 1), (1,0), . . .Then the FTL regret is at 
most one, whereas AdaHedge's performance is close to the worst case bound. This scenario 
is discussed further in the experiments, Section! 



3.2 FlipFlop 

In the following analysis we will assume, as before, that the losses satisfy £t £ [0, 1]^; 
see Section 0] for discussion of the general case. FlipFlop is a Hedge strategy in the sense 
that it uses exponential weights defined by ([8]), but the learning rate rjf now alternates 
between infinity, such that the algorithm behaves like FTL, and the AdaHedge value, which 
decreases as a function of the mixability gap accumulated over the rounds where AdaHedge 
is used. In Definition 1101 below, we will specify the "flip" regime Rt, which is the subset of 
times {1, . . . , t} where we follow the leader by using an infinite learning rate, and the "flop" 
regime R t = {1, . . . , t} \ Rt, which is the set of times where the learning rate is determined 
by AdaHedge (mnemonic: the position of the bar refers to the value of the learning rate). 
We accumulate the mixability gap, the mix loss and the variance for these two regimes 
separately: 

A T = Y & Mt=Y m t-> ( fli P) 

t£R T t£R T 

a t =Y1 s ?-> Mt = E m ^ Vt= Yl v l ( fl °p) 

t^Rrp t^Rrp t^jRrp 

We also change the learning rate from its definition for AdaHedge in ([7]) to the following, 
which differentiates between the two regimes of the strategy: 

s J ? ?f iP ifteRt, , flip fti i flop lnif , .v 

vt = < sop ... „ where vt =v t =°° and n t =a — • ( 14 ) 

[i] t y lite R t , A t _! 

Note that while the learning rates are defined separately for the two regimes, the exponential 
weights (jSJ) of the experts are still always determined using the cumulative losses L t ^ over 
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all rounds. We also point out that, for rounds t £ R T , the learning rate nf = rj t op is not 
equal to r]f , because it uses A f-1 instead of A| , _ 1 . For this reason, the FlipFlop regret may 
be either better or worse than the AdaHedge regret; our results below only preserve the 
regret bound up to a constant factor. In contrast, we do compete with the actual regret of 



It remains to define the "flip" regime Rt and the "flop" regime R t , which we will do by 
specifying the times at which to switch from one to the other. FlipFlop starts optimistically, 
with an epoch of the "flip" regime, which means it follows the leader, until At becomes too 
large compared to A t . At that point it switches to an epoch of the "flop" regime, and keeps 
using 7]f op until A t becomes too large compared to At. Then the process repeats with the 
next epochs of the "flip" and "flop" regimes. The regimes are determined as follows: 

Definition 10 (FlipFlop's Regimes) Let (p > 1 and a > be parameters of the algo- 
rithm. Then 

• FlipFlop starts in the "flip" regime. 

• Iftis the earliest time since the start of a "flip" epoch where At > ((p/a)A t , then the 
transition to the subsequent "flop" epoch occurs between rounds t and t + 1. (Recall 
that during "flip" epochs At increases in t whereas A t is constant.) 

• Vice versa, if t is the earliest time since the start of a "flop" epoch where A t > a At, 
then the transition to the subsequent "flip" epoch occurs between rounds t and t+1. 

This completes the definition of the FlipFlop strategy. See Figure [2] for a MATLAB imple- 
mentation. 

The analysis proceeds much like the analysis for AdaHedge. We first show that, analo- 
gously to Lemma El the FlipFlop regret can be bounded in terms of the cumulative mixa- 
bility gap; in fact, we can use the smallest cumulative mixability gap that we encountered 
in either of the two regimes, at the cost of slightly increased constant factors. This is the 
fundamental building block in our FlipFlop analysis. We then proceed to develop analogues 
of Lemmas [5] and [7j whose proofs do not have to be changed much to apply to FlipFlop. 
Finally, all these results are combined to bound the regret of FlipFlop in Theorem [TH which 
is the main result of this paper. 

Lemma 11 (FlipFlop version of Lemma [3]) Suppose the losses take values in [0,1]. 
Then the following two bounds hold simultaneously for the regret of the FlipFlop strategy 
with parameters (p > 1 and a > 0: 



FTL. 




(15) 



(16) 



Proof The regret can be decomposed as 



H T = - L* T = A T + A T + M T + M T - L* T . 



(17) 
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% Returns the losses of FlipFlop 

'/, l(t,k) is the loss of expert k at time t; phi > 1 and alpha > are parameters 
function h = flipflopG, alpha, phi) 

[T, K] = size(l) ; 

h = nan(T.l) ; 

L = zeros(l ,K) ; 

Delta = [0 0]; 

scale = [phi/alpha alpha] ; 

regime = 1; '/, 1=FTL, 2=AH 

for t = 1:T 

if regime==l, eta = Inf ; else eta = log(K) /Delta(2) ; end 

[w, Mprev] = mix(eta, L); 
h(t) = w * l(t, :) ' ; 
L = L + l(t, :) ; 

[", M] = mix(eta, L) ; 

delta = max(0, h(t)-(M-Mprev)) ; 

Delta(regime) = Delta(regime) + delta; 

if Delta(regime) > scale(regime) * Delta(3-regime) 

regime = 3-regime ; 
end 

end 

end 

Figure 2: FlipFlop, with new ingredients in boldface 

Our first step will be to bound the mix loss Mt + M T in terms of the mix loss M^ op of 
the auxiliary strategy that uses rj t op for all t. As rjf op is nonincreasing, we can then apply 
Lemma [2] and mix loss property $[3] to further bound 

M* op < M^° P) < L* T + ^ = L* T + A r _ x < L* T + A T . (18) 

rj T op 

Let = u\ < U2 < ■ ■ ■ < Ub < T denote the times just before the epochs of the "flip" 
regime begin, i.e. round Ui + 1 is the first round in the i-th "flip" epoch. Similarly let 
< v\ < . . . < Vb < T denote the times just before the epochs of the "flop" regime begin, 
where we artificially define Vb = T if the algorithm is in the "flip" regime after T rounds. 
These definitions ensure that we always have Ub < Vb < T. For the mix loss in the "flop" 
regime we have 

M T = (M*°p - M„ fl °P) + (M u fl °P - M£p) + . . . + (M u flo P - M^) + (M« op - M%*). (19) 

Let us temporarily write rjt = f]t° p to avoid double superscripts. For the "flip" regime, the 
properties in Lemma [lj together with the observation that r/f op does not change during the 
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"flip" regime, give 
6 



m t = Y j (< oo) - Mif) = E (mm - l;) < E (m^ - K it 

i=l i=l i=l 

< E (mS* } - + —) = £ (<P - M*<* + — ) 

- AfM + ( - M^p) + . . . + (M^P - MM + E A Ui . (20) 



i=i 

From the definition of the regime changes (Definition [TO]) , we know the value of A u . very 
accurately at the time ttj of a change from a "flop" to a "flip" regime: 

A Ut > aA Ui = aA Vi _, > <p& 0i _ 1 = vA Ui _ v 
By unrolling from low to high i, we see that 

b b oo 

i=l i=l i=l ^ 

Adding up (|19p and ()20|) . we therefore find that the total mix loss is bounded by 

6 



M T + M T < A4 op + E A Ui < M T op + —^—A Ub <L* T + + A, 

i=i V W / 

where the last inequality uses (fT8j) . Combination with (fT7|) yields 



(21) 



< ^7^77 + 2j A T + A r . (22) 

Our next goal is to relate A T and At- by construction of the regimes, they are always 
within a constant factor of each other. First, suppose that after T trials we are in the 6th 
epoch of the "flip" regime, that is, we will behave like FTL in round T + 1. In this state, 
we know from Definition [10] that A T is stuck at the value that prompted the start of the 
current epoch; this pinpoints its value up to one. At the same time, we know that At is 
large enough to have prompted the start of the (b — l)st flop epoch, but not large enough 
to trigger the next regime change. From this we can deduce the following bounds: 

(At - l)/a < A T < —A T 
a 

On the other hand, if after T rounds we are in the 6th epoch of the "flop" regime, then a 
similar reasoning yields 

-(A r - 1) < A T < a~A~T 
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In both cases, it follows that 

A r < aAf + a; 

— (p 

A T < r A T + 

a a 

The two bounds of the lemma are obtained by plugging first one, then the other of these 
bounds into ([22]). ■ 



Lemma 12 (FlipFlop version of Lemma [5]) Suppose the losses take values in [0,1]. 
Then the cumulative mixability gap for the "flop" regime is bounded by the cumulative 
variance of the losses for the "flop" regime: 

(At) 2 <Y_ T \nK + (1 + §lnif)A r . 

Proof The proof is analogous to the proof of Lemma with A T instead of A^ h , V_ T instead 
of Vp 1 , and using rjt = r]f op = ln(i i C)/A t _ 1 instead of rjt = rjf h = ln(ET)/A^ 1 . Furthermore, 
we only need to sum over the rounds R T in the "flop" regime, because A T does not change 
during the "flip" regime. ■ 



We could use this result to prove an analogue of Theorem [6] for FlipFlop, but this would 
be tedious; we therefore proceed directly to bound the variance in terms of the loss rate of 
the best expert. The following Lemma provides the equivalent of Lemma [7J for FlipFlop. 
It can probably be strengthened to improve the lower order terms; we provide the version 
that is easiest to prove. 

Lemma 13 (FlipFlop version of Lemma [7]) Suppose > L^- Then, for losses in 
the range [0, 1], the cumulative loss variance for FlipFlop with parameters (p > 1 and a > 
satisfies 

L T (T-L* T ) + r Jf , + <p + \ y 
T \ip — 1 a J a 

Proof The sum of variances satisfies 

teR T t=i t=i V / V / 

where the first inequality simply adds the variances for FTL rounds (which are often all 
zero), the second is Lemma [U and the third is Jensen's inequality. Subsequently using 
Lt < Hj< (by assumption) and, from Lemma ITTT < L^, + c, where c denotes the right 
hand side of the bound (1161). we find 



< (L* T + c)(T-L* T ) ^ L* T (T-L* T ) c 



which was to be shown. 



Combining Lemmas 111! 1121 and 1131 we obtain our main result: 
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Theorem 14 (Main Theorem, FlipFlop version of Theorem [8]) Suppose the losses 
take values in [0,1]. Then the regret for FlipFlop with doubling parameters (p > 1 and 
a > simultaneously satisfies the bounds 

\<p-l J ip-1 



L* (T — L* ) 

K T < ciy r T — —lnK + Cl ( Cl + l)\nK + c iy /c2 InK + a + c 2 , 

where c\ = 1 h 2 and c 2 = — . 

99 — 1 a a 

This shows that, up to a multiplicative factor in the regret, FlipFlop is always as good as 
the best of Follow-the-Leader and AdaHedge's bound. Of course, if AdaHedge significantly 
outperforms its bound, it is not guaranteed that FlipFlop will outperform the bound in the 
same way. 

In the experiments in Section [5] we demonstrate that the multiplicative factor is not just 
an artifact of the bounds, but can actually be observed on simulated data. 
Proof From LemmaEl we know that At < = TZ^ 1 . Substitution in (fT5"|) of Lemma [TT1 

yields the first inequality. 

For the second inequality, note that L T > Hj, means the regret is negative, in which 
case the result is clearly valid. We may therefore assume w.l.o.g. that L T < and apply 
Lemma [T3j Combination with Lemma [T2l yields 



U* (T — L* ) 

(At) 2 <Y_ T \nK + (1 + §lnfOA T < TV — 2Z \ n K + c 2 In if + c 3 A T , 

where C3 = 1 + (ci + |) In K . We now solve this quadratic inequality as in (I12p and relax it 
using \J a + b < yfa + \fb for nonnegative numbers a, b to obtain 



' L* (T — L* ) 
At < \l — — InK + c 2 lnK + c 3 



< y r(T T L ^ lnK +(c 1 + l)\nK+y/c 2 ]nK + l. 
In combination with Lemma ITTT this yields the second bound of the theorem. 



Finally, we propose to select the parameter values that minimize the constant factor in 
front of the leading terms of these regret bounds. 

Corollary 15 The parameter values ip* = 2.37 and a* = 1.243 approximately minimize 



the worst of the two leading factors in the bounds of Theorem 14 The regret for FlipFlop 
with these parameters is simultaneously bounded by 

K T < 5.647^' + 4.64, 



K T < 5.64y Lt ^ T t Lt) In K + 35.53 In K + 7.78\/hT¥ + 7.54. 
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Proof The leading factors f((p,a) = + 2a + 1 and g(ip,a) = + « + 2 are 
respectively increasing and decreasing in a. They are equalized for a((p) = (2<p — 1 + 
\/l2(/? 3 — 16ip 2 + 4ip + l) /(6ip — 4). The analytic solution for the minimum of f(tp, «(y?)) in 
(p is too long to reproduce here, but it is approximately equal to tp* = 2.37, at which point 
a(ip*) 1.243. ■ 



4. Invariance to Rescaling and Translation 

In the previous two sections, we have assumed, for simplicity, that the losses &t,k were 
translated and normalised to take values in the interval [0, 1]. Although this is a common 
assumption in the literature, it requires a priori knowledge of the range of the losses. One 
would there fore prefer algorithms tha t do not require the losses to be normalised. As 
discussed by Cesa-Bianchi et all ( 20071 ) . the regret bounds for such algorithms should not 



change when losses are translated (because this does not change the regret) and should scale 
by a when the losses are scaled by a factor a > (because the regret scales by a). They 
call such regret bounds fundamental and show that most of the methods they introduce 
satisfy such fundamental bounds. 

Here we go even further: it is not just our bounds that are fundamental, but also our 
algorithms, which do not change their output weights if the losses are scaled or translated. 

Theorem 16 Both AdaHedge and FlipFlop are invariant to translation and rescaling of 
the losses. Starting with losses £\,... ,£t, obtain rescaled, translated losses £^, . . . ,£' T by 
picking any a > and arbitrary reals t%, . . . ,tt, and setting £' tk = o~£t,k + T~t for t = 1, . . . , T 
and k = 1, . . . ,K. Both AdaHedge and FlipFlop issue the exact same sequence of weights 
w' t = wt on £' t as they do on £t- 

Proof We annotate any quantity with a prime to denote that it is defined with respect to 
the data set £' t . We omit the algorithm name from the superscript. First consider AdaHedge. 
We will prove the following relations by induction on t: 

A{_ 1 =<7A t _i; v't = ~; w' t = w t . (23) 

a 

For t = 1, these are valid since Aq = oAq = 0, rj^ = rji/a = oo, and w[ = w\ are 
uniform. Now assume towards induction that (|23p is valid for some t E {1,...,T}. We 
can then compute the following values from their definition: h[ = w' t • £' t = aht + Tt\ 
m' t = —(l/rj' t )ln(w' t • e -71 ^) = am t + Tf, 5' t = h' t — m' t = a(h t — m t ) = o5f Thus, the 
mixability gaps are also related by the scale factor a. From there we can reestablish the 
induction hypothesis for the next round: we have A' t = A^_ x + S' t = a A t -i + aSt = crA t , 
and r]' t+l = ln(K)/A' t = r) t+ i/a. For the weights we get w' t+1 cx e" r? '+i" i ' oc e - (r?t/<T) " ( ^t) oc 
wt+i, which means the two must be equal since both sum to one. Thus the relations of (|23p 
are also valid for time t + 1, proving the result for AdaHedge. 

For FlipFlop, if we assume regime changes occur at the same times for £' and £, then 
similar reasoning reveals A/ = oAt; A£ = oA t , r/f ip = rj t ip /a = oo, f/f° p = n t op /a, and 
w' t = Wt- Remains to check that the regime changes do indeed occur at the same times. 



20 



Follow the Leader If You Can, Hedge If You Must 



Note that in Definition [JCjJ the "flop" regime is started when At > ((/?/a)A_t, which is equiv- 
alent to testing At > (ip/a)A t since both sides of the inequality are scaled by a. Similarly, 
the "flip" regime starts when A! t > a At', which is equivalent to the test A t > a At. ■ 

Making our bounds fundamental is a simple corollary of Theorem [TBI For AdaHedge the 



result is a slight improvement of the bound (UJ) for the CBMS algorithm bv lCesa-Bianchi et al 
(|2007I l 



Corollary 17 Fix arbitrary losses £±, . . . ,£t inM, and let 

fit = min l t h o = max m&x(l t t - fi t ) 

k te{l,...,T} k 

be the minimal loss in round t and the scale of the losses, respectively. Then, without 
modification, AdaHedge and FlipFlop satisfy the regret bounds 



and 



N*(aT-N*) , 1fi , 

nf < 2\ TK ^InK + a(flnK + 2) 



N*(o~T — N*) / 

7Zj> < ad TV _ ^ln^ + j ci(ci + hlnK + c 1 y / c 2 InK + a + 



C2 

where Nj, = L^, — Ylt=i Mt ^ s the optimally translated loss of the best expert, and c\ and C2 
are the same constants as in Theorem\tJ\ 

Proof Define the normalised losses £' tk = (£ t ,k — fH)/&, an d let 7£ ah ' , 1Z S ' and 1Z itl ' re- 
spectively denote the regret of AdaHedge, FlipFlop and Follow-the-Leader when run on 
these losses. Also let L'^ = (L^ — X^tli/ 1 *)/ " denote the corresponding loss of the best 
expert. Then we have = aL'j and by Theorem \W\ also 7£ ah = aTZ ah ' , TZ S = aTZ s ' and 
1Z = alZ m . The corollary follows by plugging these identities into the bounds obtained 
by applying Theorems l8l and [141 to the normalised losses £[,... ,£' T . ■ 



5. Experiments 

We performed four experiments on artificial data, designed to clarify how the learning rate 
determines performance in a variety of Hedge algorithms. We have kept the experiments 
as simple as possible: the data are deterministic, and involve two experts. In each case, the 
data consist of one initial hand-crafted loss vector, followed by a sequence of 999 loss vectors 
which are either (0 1) or (1 0). The data are generated by sequentially appending the loss 
vector that brings the cumulative loss difference L ti \ — L t ^ closer to a target f^(t), where 
£ G {1, 2, 3, 4} indexes a particular experiment. Each : [0, oo) — > [0, oo) is a nondecreasing 
function with /g(0) = 0; intuitively, it expresses how much better expert 2 is than expert 1 
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as a function of time. The functions change slowly enough that our construction has the 

property \L t> i - L i>2 - fe(t)\ < 1 for all t. 

(v) 

For each experiment, we first plot 1Z T , the regret of the Hedge algorithm as a function 
of the fixed learning rate 77. We subsequently plot the regret izf g as a function of the time 
t = 1, . . . ,T = 1000, for each of the following algorithms "alg": 

1. Follow-the-Leader (Hedge with learning rate oo), 

2. Hedge with fixed learning rate rj = 1, 



3. Hedge with the learning rate that optimizes the worst-case bound © (rj = ^J&hi(K) /T 
0.0745); we will call this algorithm "safe Hedge" for brevity, 

4. AdaHedge, 

5. FlipFlop, 



6. Hazan and Kale's 120081 algorithm, using the fixed learning rate that optimises the 



bound provided in their paper. 
7. NormalHedge, described by Chaudhuri et al. ( 20091 ) . 



Note that the safe Hedge strategy (the third item above) can only be used in practice if the 
horizon T is known in advance. Hazan and Kale's algorithm (the sixth item) additionally 
requires precognition of the losses incurred by the various actions up until T. In practice 
these algorithms would have to be used in conjunction with the doubling trick, which would 
result in substantially worse, and harder to interpret, results. 

We include algorithms 6 and 7 because, as we explained in Section [L~2l they are the state 
of the art in Hedge-style algorithms. To reduce clutter, we omit results for the algorithm 



described in ICesa-Bianchi et al.l ([20071 ): its behaviour is very similar to that of AdaHedge. 



Below we provide an exact description of each experiment, and discuss the results. 



5.1 Experiment 1. Worst case for FTL 

The experiment is defined by t\ = (^ 0), and fi(t) = 0. This yields a loss matrix £ that 
starts as follows: 



(1/2 10 1 
v 10 10 




These data are the worst case for FTL: each round, the leader incurs loss one, while each of 
the two individual experts only receives a loss once every two rounds. Thus, the FTL regret 
increases by one every two rounds and ends up around 500. For any learning rate 77, the 
weights used by the Hedge algorithm are repeated every two rounds, so the regret Ht — L\ 
increases by the same amount every two rounds: the regret increases linearly in t for every 
fixed 77 that does not vary with t. However, the constant of proportionality can be reduced 
greatly by reducing the value of 77, as the top graph in Figure [3] shows: for T = 1000, 
the regret becomes negligible for any 77 less than about 0.01. Thus, in this experiment, a 
learning algorithm must reduce the learning rate to shield itself from incurring an excessive 
overhead. 



22 



Follow the Leader If You Can, Hedge If You Must 



The bottom graph in Figure [3] shows the expected breakdown of the FTL algorithm; 
Hedge with fixed learning rate rj = 1 also performs quite badly. When rj is reduced to the 
value that optimises the worst-case bound, the regret becomes competitive with that of the 
other algorithms. Note that Hazan and Kale's algorithm has the best performance; this is 
because its learning rate is tuned in relation to the bound proved in the paper, which has 
a relatively large constant in front of the leading term. As a consequence the algorithm 
always uses a relatively small learning rate, which turns out to be helpful in this case but 
harmful in later experiments. 

The FlipFlop algorithm behaves as theory suggests it should: its regret increases alter- 
nately like the regret of AdaHedge and the regret of FTL. The latter performs horribly, so 
during those intervals the regret increases quickly, on the other hand the FTL intervals are 
relatively short-lived so they do not harm the regret by more than a constant factor. 

The NormalHedge algorithm still has acceptable performance, although it is relatively 
large in this experiment; we have no explanation for this but in fairness we do observe good 
performance of NormalHedge in the other three experiments as well as in numerous further 
unreported simulations. 

5.2 Experiment 2. Best case for FTL 

The second experiment is defined by i\ = (1 0), and fzif) = 3/2. The induced loss matrix 
I starts as follows: 



A 1 1 
^00101 




These data look very similar to the first experiment, but as the top graph in Figure [H 
illustrates, because of this small change, it is now viable to reduce the regret by using a 
very large learning rate. In particular, since there are no leader changes after the first 
round, FTL incurs a regret of only 1/2. 

As in the first experiment, the regret increases linearly in t for every fixed rj (provided 
it is less than oo); but now the constant of linearity is large only for learning rates close 
to 1. Once FlipFlop enters the FTL regime for the second time, it stays there indefinitely, 
which results in bounded regret. We observe that NormalHedge adapts in the same way to 
these data. The behaviour of the other algorithms is very similar to the first experiment, 
and as a consequence their regret grows without bound. 

5.3 Experiment 3. Weights do not concentrate in AdaHedge 

The third experiment uses t\ = (1 0), and fs(t) = t 0A . The first few loss vectors are the 
same as in the previous experiment, but every now and then there are two loss vectors (1 0) 
in a row, so that the first expert gradually falls behind the second in terms of performance. 
By t = T = 1000, the first expert has accumulated 508 loss, while the second expert has 
only 492. 

For any fixed learning rate rj, the weights used by Hedge now concentrate on the second 
expert. We know from Lemma H] that the mixability gap in any round t is bounded by a 
constant times the variance of the loss under the weights played by the algorithm; as these 
weights concentrate on the second expert, this variance must go to zero. One can show that 
this happens quickly enough for the cumulative mixability gap to be bounded for any fixed 
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77 that does not vary with t or depend on T. From @ we have 

ftW =M T -L* T + < — + bounded = bounded. 

Tj 

So in this scenario, as long as the learning rate is kept fixed, we will eventually learn the 
identity of the best expert. However, if the learning rate is very small, this will happen so 
slowly that the weights still have not converged by t = 1000. Even worse, the top graph 
in Figure [5] shows that for intermediate values of the learning rate, not only do the weights 
fail to converge on the second expert sufficiently quickly, but they are sensitive enough to 
increase the overhead incurred each round. 

For this experiment, it really pays to use a large learning rate rather than a safe small 
one. Thus FTL, Hedge with rj = 1, FlipFlop and NormalHedge perform excellently, while 
safe Hedge, AdaHedge and Hazan and Kale's algorithm incur a substantial overhead. Ex- 
trapolating the trend in the graph, it appears that the overhead of these algorithms is 
not bounded. This is possible because the three algorithms with poor performance use a 
learning rate that decreases as a function of t. As a concequence the used learning rate 
may remain too small for the weights to concentrate. For the case of AdaHedge, this is an 
example of the "nasty feedback loop" described in Section [3j 

5.4 Experiment 4. Weights do concentrate in AdaHedge 

The fourth and last experiment uses l\ = (1 0), and fi(t) = i 0,6 . The losses are comparable 
to those of the third experiment, but the performance gap between the two experts is 
somewhat larger. By t = T = 1000, the two experts have loss 532 and 468, respectively. It 
is now so easy to determine which of the experts is better that the top graph in Figure [6] is 
nonincreasing: the larger the learning rate, the better. 

The algorithms that managed to keep their regret bounded in the previous experiment 
obviously still perform very well, but it is clearly visible that AdaHedge now achieves the 
same. As discussed below Theorem [6j this happens because the weight concentrates on the 
second expert quickly enough that AdaHedge's regret is bounded in this setting. Thus, while 
the previous experiment shows that AdaHedge can be tricked into reducing the learning rate 
while it would be better not to do so, the present experiment shows that on the other hand, 
sometimes AdaHedge does adapt really nicely to easy data, in contrast to algorithms that 
are tuned in terms of a worst-case bound. 
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6. Discussion and Conclusion 

The main contributions of this work are twofold. First, we develop a new hedging algorithm 
called AdaHedge. The analysis simplifies existing results and we obtain improved bounds 
(Theorems [6] and [HJ). Moreover, AdaHedge is the first sophisticated Hedge algorithm that 
is "fundamental" , i.e. its weights are invariant under translation and scaling of the losses 
(Section d]). Second, we explain in detail why it is difficult to tune the learning rate such 
that good performance is obtained both for easy and for hard data, and we address the 
issue by developing the FlipFlop algorithm. FlipFlop never performs much worse than the 
Follow-the-Leader strategy, which works very well on easy data (Lemma [9]), but it also 
retains a worst-case bound similar to the bound for AdaHedge (Theorem 1 14|) . As such, this 
work may be seen as solving a special case of a more general question. Below we briefly 
address this question and then place this work in a broader context, which provides an 
ambitious agenda for future work. 



6.1 General Question: Competing with Hedge for any fixed learning rate 

FlipFlop has regret to within a multiplicative constant of Hedge with learning rate oo (FTL) 
and Hedge with a variable, nonincreasing learning rate which achieves optimal regret in the 
worst-case. It is now natural to ask whether we can design a "Universal Hedge" algorithm 
that can compete with Hedge with any fixed learning rate < rj < oo. That is, for all T, 

the regret up to time T of Universal Hedge should be within a constant factor C of the 

iff) 

regret incurred by Hedge run with the fixed f] that minimizes the Hedge loss Hj, . This 
appears to be a difficult question, and maybe such an algorithm does not even exist. Yet 
even partial results (such as an algorithm that competes with rj £ [Wln(K)/T, oo] or with 
a factor C that increases slowly, say, logarithmically, in T) would already be of significant 
interest. 

In this regard, it is interesting to note that in practical applications, the learning rates 
chosen by sophisticated versions of Hedge do not always perfor m very well; higher learning 



rates often do better. This is noted by iDevaine et al.1 (|2012l ). who resolve this issue by 



adapting the learning rate sequentially in an ad-hoc fashion which works well in their 
application, but for which they can provide no guarantees. A Universal Hedge algorithm 
would adapt to the optimal learning rate- with- hindsight. FlipFlop is a first step in this 
direction. Indeed, it already has some of the properties of such an ideal algorithm: under 
some conditions we can show that if Hedge achieves bounded regret using any learning rate, 
then FTL, and therefore FlipFlop, also achieves bounded regret: 

Theorem 18 Fix any rj > 0. For K = 2 experts with losses in {0, 1} we have 
IZj! is bounded =>■ 1Z T is bounded =>■ 1Z T is bounded. 

The proof is in Appendix [HI While the second implication remains valid for more experts 
and other losses, we currently do not know if the first implication continues to hold as well. 



6.2 The Big Picture 

Broadly speaking, a "learning rate" is any single scalar parameter controlling the relative 
weight of the data and a prior regularization term in a learning task. Such learning rates pop 
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up in batch settings as diverse as L\ /^-regularized regression s uch as Lasso and Ridge , 
standard Bayesian nonparametric and PAC-Bayesian inference (|Zhangl . 120061 ; lAudibertl . 
20041 : ICatonl l2007l h and — as in this paper — in sequential prediction. In batch settings 



one may sometimes set the learning rate by cross-validation, but this does not always come 
with theoretical guarantees, and cannot easily be extended to the sequential prediction 
setting. In a Bayesian approach, one can set the learning rate by treating it as just another 
parameter, equipping it with a prior and marginalizing or determining the MAP value; it 
is know n that this can f ail dramatically however, if all the models under consideration are 
wrong (iGriinwaldl . I2012I ). All the applications just mentioned are similar in that they can 
formally be seen as variants of Bayesian inference — Bayesian MAP in Lasso and Ridge, 
randomized drawing from the posterior ("Gibbs sampling") in the PAC-Bayesian setting 
and Hedge in the setting of this paper. An ideal method for adapting the learning rate 
would work in all such cases. We currently have methods that are guaranteed to work for 
a few special cases (see Tabled]). It is encouraging that all these methods are based on 
the same, apparently fundamental, quantity, the mixability gap as defined before Lemma [TJ 
they all employ different techniques to ensure a learning rate under which the posterior is 
concentrated and hence the mixability gap is small. This gives some hope that the approach 
can be taken even further. 



method 


mode 


complexity 


setting 


minimizes 


competes with 
best rj in: 


predicts/ 
estimates 


FlipFlop 


sequential 
prediction 


finite 


worst-case 


regret 


r] e {?/f op ,cxj} 


averages 


£ 
( 


afe two-part MDL 
Griinwaldl. 2011) 


batch 


countably 
infinite 


stochastic, 
i.i.d. 


excess 
risk 


v e B 2 


point 


£ 
( 


afe Bayes 
Griinwaldl. 2012) 


batch 


completely 
arbitrary 


stochastic, 
i.i.d. 


excess 
risk 


T) G B 2 


averages 



Table 2: Methods that compete with the best r\ for special cases 



In Table [21 "complexity" refers to the maximum number of actions/experts in the DTOL 
setting of FlipFlop and the maximum number of predictors (e.g. classifiers, regression func- 
tions) with prior support in the stochastic setting. In the stochastic setting we invariably 
assume that data are of the form (Xi,Yi) and the goal is to predict Y based on X. B2 
is defined as the set {1,2 ,2 -2 , . . .}. The Safe Bayes and MDL algorithms may even be 
capable of competing with the best rj £ (0,oo). While we currently do not know whether 
this is the case, we note that, in the stochastic setting, being able to compete with the best 
i] E 1>2 is already satisfactory: it implies that one can achieve minimax optimal risk conver- 
gence rates in a variety of settings, e.g. if a Tsybakov margin condition holds (j Grunwald . 



2012 b 



The safe two-part MDL estimator produces point estimates of the best available predic- 
tors; analogously to FlipFlop, the safe Bayesian estimator averages all predictors according 
to its posterior. The two "safe" algorithms can deal with arbitrary loss functions as long as 
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the loss is almost surely bounded. If the data are sampled from a distribution with bounded 
support, this even holds if the loss function is itself unbounded. 

All this suggests a major goal for future work: extending the worst-case approach of 
this paper to the settings that are currently dealt with only in the stochastic case. First, 
as already explained above, we would like to be able to compete with all 77 in some set 
that contains a whole range rather than just two values. Second, we would like to compete 
with the best i] in a setting with a countably infinite number of experts equipped with an 
arbitrary prior mass function W\. Third, as an ultimate goal, we would like to develop a 
method that can compete with the best r] with completely arbitrary sets of experts equipped 
with some prior distribution W. The second and third goal require a slight modification of 
the type of results in this paper: currently, our results all start with the basic identity and 
bound ©, repeated here for convenience: 

TZ T = (Mr - L* T ) + A T < + A T . 

V 

For the case of infinitely many experts, this should be replaced by the following identity 
and inequality, which hold simultaneously for all distributions Q on the set of experts; the 
idea is to choose Q so as to get a useful bound. 

H T -Q • L T = (M T - Q • L t ) + A t 

^{ D(V l Wl Kv-L T ]-Q.L T ) + A T 

V 

where for convenience we defined Q • Lt ■= Ek~q[Lt,k], the expected value of the cu- 
mulative loss under distribution Q. Here W\ is a user-defined prior distribution on the 
set of experts, analogous to our probability mass function w±, and -D(-||-) denotes the KL 
divergence between two distributions on experts. The inequality is trivial; the equality is 
a w ell-known resu lt both in the sequential prediction and the PAC-Bayesian literature; see 
e.g. IZhand ||2006l ). To make (|24p more concrete, consider a countable set of experts, fix an 



expert k and take Q to be a point mass on k. Then Q • Lt = Ltic and .D(Q||Wi) becomes 
equal to — lnit?i(fc), so (f24"j) can be further rewritten as 

— In w\ k 

Ht — L T h < — + ^t, 

V 

We hope that using this bound, analogously to our use of ([6]) in the current paper, one can 
prove bounds similar to those appearing in Theorem [8] and [I~4l with all occurrences of L T 
and InK replaced by -Lr,fc and — ln-wi^. Here k can be thought of as a 'comparator' expert, 
and the bounds should hold uniformly for all k but get progressively weaker for k with small 
initial prior weight wi For the case of uncountable sets of experts, the hope is again to 
prove results similar to Theorem [5] and [F"H but now based on (|24p . Such results would give 
strong worst-case performance bounds on huge, "nonparametric" sets of experts such as 
Gaussian proc e ss mo dels. Currently such worst-case bounds exist for the logarithmic loss 
( Kakade et al. . 20061 ) . but not for any other loss function. 
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Appendix A. Proof of Lemma Q] 

The result for 77 = oo follows from r/ < oo as a limiting case, so we may assume without loss 
of generality that r/ < oo. Then nit < ht is obtained by using Jensen's inequality to move 
the logarithm inside the expectation, and nit > and ht < 1 follow by bounding all losses 
by their minimal and maximal values, respectively. The next two items are analogues of 
similar basic results in Bayesian probability. Item [2] generalizes the chain rule of probability 
Pr(xi, ...,x T )= l\J =1 Pr(x t | an, ... , x t -i): 

1 JL. in- . t>—VLt 1 

M T = --InTT T = -- ln( Wl ■ e~ r > LT ). 

For the third item, use item [2] to write 

M T = --In 
V 

The lower bound is obtained by bounding all Lt,h from below by L^; for the upper bound 
we drop all terms in the sum except for the term corresponding to the best expert and use 
Wi t k = t/K. 

For the last item, let < rj < 7 be any two learning rates. Then Jensen's inequality 
gives 

- - In w t • e~^ LT = - - In w t • (e" 7 ^ ) vh > - - In (ti>i • e^ Lr ) vh = - - In w x • e~^ LT . 
T] rj 7] x 7 

This completes the proof. ■ 




Appendix B. Proof of Theorem 

Suppose that FTL has unbounded regret. We argue that Hedge with fixed rj must have 
unbounded regret as well. First remove all trials where both experts suffer the same loss, as 
these trials do not change the regret of either FTL or Hedge. Abbreviate d t = L t: 2 — L ti \. 
We say that a leader change happens at t when dt-idt+i < 0, that is, dt crosses zero at t. 
Since FTL has unbounded regret, there are infinitely many leader changes. 

We call a point-pair (t,t + 1) a local extremum if the losses in trials t and t + 1 are 
opposite, i.e. (dt+i — dt)(dt — dt-i) < 0. Observe that a leader change can not be a local 
extremum. Over a local extremum, Hedge suffers loss > 1 but the best expert only suffers 
loss 1. The regret of Hedge is hence decreased when the trials t and t + 1 are removed. 
Iterated removal of local extrema leads to the dt sequence 

0,+l, 0,-1,0, +1,0,-1,... 

The regret of Hedge on this sequence is linear in t. To see this, observe that over one period 
the loss of the best expert increases by 2, while the loss of Hedge increases by 

2- + 2 > 2. 

2 l + e-* 

Hence the Hedge regret is unbounded on the original loss sequence. ■ 
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